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RECOMMENDATION  OF  COMPUTER  SYSTEMS 
FOR  GPFRATIOM  OF  THE  HABITABILITY  DATA  BASE 


1.  Summary  of  Recommendations 

The  results  of  this  invest ination  suoaest  a  number  of 
courses  of  action  w  h  i  c  h  will  be  in  the  best  interests  of  the 
Construction  Engineering  Research  Laboratory  and  their 
implementation  of  the  Habitability  Data  Base.  We  divide  the 
recommendations  into  short  and  lon;j  term  based  on  the  amount 
of  time  and  effort  which  will  be  involved  to  achieve  a 
measure  of  satisfaction  of  the  goals  of  the  HDB  effort. 

1.1  Short  Term  Reccmmendat i on s 

a)  Reactivate  the  HDB  implementation  at  the 
Computer  Services  Office  of  the 
University  of  Illinois.  This  holding 
action  will  make  somethino  available 
until  mid-1977*  while  work  is  underway 
to  provide  a  lonqer-tern  solution. 

b)  Implement  the  HDB  files  and  versions  of 
the  AND/  OR*  DOCAX,  and  BIbAX  programs 
on  the  Michigan  Time  Sharing  system. 

c)  Implement  SMART  on  MTS/  but  do  the  rest 
of  the  tasks  on  the  UNIX  system  at  CAC. 

1.2  Lona  Term  Recommendations 


a)  Expand  the  CELDS  work  at  CAC  into  a  form 
suitable  for  both  the  Environmental  IAC 
and  the  Habitability  IAC.  Fool  efforts 
with  the  CELDS  group  to  acquire 
appropriate  hardware  to  be  used  for  both 
sy s t  ems . 

b)  Convert  t  h  p  HDb  to  a  forn»  suitable  for 
the  Lockheed  "DIALOG"  system  and  use  it 
remotely  throuqh  a  Telenet  port  in 
Chi  cago . 
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2  .  Foreword 

The  recommendations  of  this  report  are  based  on  a  short  term 
investigation  which  precluded  hands-on  trial  of  most  of  the 
systems.  The  reader  is  cautioned  not  to  base  a  lona-term 
expensive  information  system  implementation  solely  on  the  results 
of  examining  the  features  of  systems  which  dre  available.  A  more 
advisable  course  of  action  would  be  to  take  steps  to  contact 
vendors  of  the  suqgested  commercial  systems  directly  to  discuss 
the  suitability  of  their  system  for  the  HDP.  It  is  also 
advisable  to  use  their  system  on  existing  data  bases  to  oain  a 
feeling  for  the  kind  of  response  times  typically  available*  the 
ease  with  which  a  search  can  be  made*  and  the  convenience  of 
gaining  access  to  the  system. 

It  is  not  possible  to  determine  the  real  cost/  speed/-  or 
reliability  of  a  system  just  by  looking  at  the  price  sheets  and 
the  vendor's  description  of  the  system.  Surveys  and  literature 
arches  can  only  guide  one  in  deciding  which  systems  deserve  a 
closer  look.  They  should  never  be  used  in  the  place  of  hands  on 
experience  or  a  carefully  documented  benchmark  to  determine  the 
choice  of  a  system  which  will  be  used  over  an  extended  period  of 
t  ime. 

No  claim  is  made  that  alt  possible  systems  have  been 
included  in  our  considerations.  Work  currently  underway  at 
various  research  centers  may  not  be  in  the  current  literature. 
It  is  suggested  that  CERL  may  want  to  avail  themselves  of  some  of 
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the  automated  information  systems  accessible  from  local  sources 
to  make  their  own  search  of  the  literature  to  obtain 
bibliographies  relating  to  automated  information  retrieval.  Some 
of  the  sources  listed  in  the  bibliography  of  this  report 
themselves  contain  extensive  bibliographies  which  probably  should 
be  examined  in  areater  detail. 


3.  Approach 

The  search  for  an  appropriate  recommendation  included  a 
literature  survey  and  an  examination  of  locally  known  systems 
which  have  the  potential  to  meet  the  needs  of  the  HOP.  The 
literature  survey  included  data  base  management  systems  and 
large-scale  interactive  information  systems.  The  similarity  of 
the  HDB  project  to  another  project  currently  under  investigation 
at  the  Center  for  Advanced  Computation  led  to  a  more  thorough 
comparison  of  the  needs  and  features  of  the  two  systems  to 
determine  whether  a  recommendation  to  combine  the  two  efforts  was 
w  arrant ed. 

The  initial  literature  survey  was  directed  at  data  base 
management  systems/  since  they  would  "form  the  backbone  of  any 
information  retrieval  system  which  involved  managino  a  new  set  of 
data  (the  HD!J).  The  essential  needs  of  the  H 0 3  are  a  data  base 
management  system*  some  type  of  interactive  editing  capability* 
and  the  ability  to  recreate  the  interactive  p  r  o  g  r  a .«  b  which 
provide  the  ^  N  D  and  OR  functions  available  in  the  earlier  version 
of   the   HDB.   The  ability  to  no  the  A  N  D  0  R  functions  are    taken  as 
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given  in  all  systems  considered-  This  task  is  simply  not 
sufficiently  complex  to  be  a  meaningful  consideration  in  any 
recommendat ion .  Furthermore/  since  CfcRL  is  usually  restricted  to 
using  time  sharing  services*  it  was  thought  that  the  choices  of 
computer  hardware  were  restricted  to  an  ILM  360/370  system*  a 
large  CDC  system  known  to  be  available  to  C  E  R  L  *  or  possibly  one 
of  the  larger  DEC-10  systems*  none  of  which  is  reknowned  for  its 
built-in  capability  to  manage  files  effectively.  For  any  of 
these*  a  sophisticated  data  base  management  system  would  be 
desirable  as  a  prerequisite  to  imolementing  an  HDB.  The 
requirement  for  managing  a  large  data  base  was  thus  taken  as  a 
controlling  factor  and  led  to  the  consideration  of  data  base 
management  systems. 

Further  experience  with  the  existing  HDB  documentation  and 
discussions  with  the  technical  monitor  and  programmers  involved 
in  the  implementation  of  the  SNlART  system  reinforced  the  notion 
that  the  HDB  is  not  conceptually  different  from  a  bibtioaraphic 
system*  either  in  the  way  that  data  is  stored  or  in  the  kind  of 
programs  which  would  be  required  to  respond  to  the  customer's 
query.  The  HDb  is  essentially  a  collection  of  statements*  each 
having  been  indexed  in  a  special  way*  and  each  referring  by  some 
means  to  the  original  document  from  which  it  was  taken.  This  is 
conceptually  similar  to  a  collection  of  abstracts*  indexed  by  key 
words  or  other  searchable  fields*  and  each  pointing  or  ?eierring 
to   the   original  bibliographic  citation.  Section  4  discusses  the 
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needs  of  the  H D L<  as  an  information  retrieval  system. 

This  conceptual  similarity  widened  t  h  t  Literature  se:-rch  to 
include  a  closer  look  at  the  v»  e  r  y  specialized  t  y  \.  e  of  i  a  t  a  base 
management  systems  with  query  I  a  n  q  u  a  . ;  e  s  which  can  t  e  called 
bibliographic  svstems  or  interactive  information  systems. 
Through  a  series  of  literature  sources  we  were  led  to  a  survey 
done  by  the  National  bureau  of  Standards  in  19  73  (published  in 
1 9 7 A ) .  This  source  constitjtes  a  reference  to  the  technical 
features  and  operational  status  of  interactive  information 
systems/'  that  is  /  those  providino  a  'conversational"  u s a o e  mode 
to  a  "non-pro  i?  rammer'  through  a  data  terminal.  In  addition  to 
technical  information  about  some  A  6  systems/  it  provides  qui  dance 
in  the  use  of  the  index  to  narrow  the  field  of  choices  in 
selecting  an  interactive  information  system  for  a  particular 
application.  Section  b  discusses  the  approach  suggested  by  this 
reference  and    i o e s  t  h  r  o       first   .  r  t  i  o n   of   systems 

i  ch  meet  the  needs  •• f  1  l  ■  ■ 

S  e  c  t  i  on  6  i  i  s  c  u s  i  p  r  op  r  i     •  ■  c      >>>  AR  T  a  s  c  choice 

or   i  m  p  I  e rr  en t  a t  i  '  •  ,   z    w  e  II  as  the  implications  of 

continuing  with  S:vART  for  I     next   phase   of   HOd   development. 

ection  7  discusses  some  of  the  issues  inherent  in  abandonina  the 
S^IART  program  and  rep  I  a  cinq  it  with  some  other  s/stem/  whether 
developed  anew  or  adapted  from  existing  systems. 


A.  Meeds  of  the  Habitability  Data  base 

We  seoarate  the  notions  of  conceptual  needs  of  the  H  D  t>   from 
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the  current  needs  in  the  following  way:  Conceptual  needs  are 
based  on  the  problem  itself*  that  is*  the  problem  of  storing  the 
HDB  and  retrievinq  the  information  stored  therein  on  the  basis  of 
user  requests.  Conceptual  needs  reflect  the  end  user  of  the  HDb. 
Current  needs*  on  the  other  hand*  deal  with  the  more  practical 
immediate  concerns  of  the  HDB  effort*  making  the  service 
available  to  the  current  set  of  users  in  a  cost  effective  manner 
as  quickly  as  possible.  The  current  need  seems  to  be  primarily 
for  a  system  which  will  run  the  A  N  D  0  R  programs  and  allow  StfART 
requests  to  be  submitted  in  a  batch  mode.  We  include  in  current 
needs  any  system  which  could  do  the  user's  end  function  as 
effectively  as  SPiART*  without  extensive  reprogramming  or 
reformatting  of  the  data  base. 


4.1  Conceptual  Needs  of  the  Habitability  Data  base 

Conceptually*  the  hDR  consists  of  a  set  of  statements  drawn 
from  an  appropriate  literature.  These  statements  are  formulated 
by  trained  specialists  who  not  only  condense  the  information  in 
the  literature*  but  also  classify  the  information  by  indexing  the 
statement.  This  process  is  conceptually  equivalent  to  abstracting 
a  document  and  providing  an  index  classification  of  the  document. 
Whereas  most  bibliographic  retrieval  systems  keep  the  information 
about  the  document  in  the  same  record  as  the  abstract  (and 
possibly  key  words  in  addition  to  author*  title*  etc)*  in  He  HDB 
the  only  information  directly  linking  the  entry  in  HDb  with  the 
original  source  of  the  information  is  a  document   number   encoded 
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as  part  of  the  sequence  number  field  of  the  HDb  statements.  The 
index  of  the  document  is  a  multidigit  strinq  of  codes  which  is 
prepended  to  the  first  card  image  of  a  particular  statement. 

The  user  of  the  HDb  wishes  to  formulate  a  simple  request  to 
retrieve  information  which  is  of  immediate  concern  to  him.  With 
the  HDB  as  originally  designed*  this  request  is  stated  in  terms 
of  the  classification  of  the  statement  as  represented  by  the 
index.   This  index  is  comprised  of  10  coded  values:  Lbl 


F  U  N  C  .  .  .  a  three  digit  functional  area  code 

TRFC...a  ">  digit  training  facility  code 

PHYS...3  1  digit  physical  setting  code 

AENV...a  2  digit  "A"  environmental  descriptor 

BENV...a  2  digit  MF"  environment  descriptor 

0CCU...a  1  digit  occupant  code 

P S T R .  .  .  a  1  digit  coie  for  posture  of  people 

I N  V  M  ...  a  1  dioit  code  for  involvement  of  people 

0R6F...S  1  digit  code  for  organizational  functions 

SFCN...a  1  digit  code  for  function  of  the  state  merit 

A  more  complete  description  of  the  classification  and 
indexing  scheme  is  given  in  C  1  J  . 

The  primary  programs  for  selecting  statements  interactively 
on  the  basis  of  the  indexes  are  the  A'mD  and  OR  oroarams  which  run 
interactively  on  the  D E € - 1 U  system  as  part  o  *  the  Prototype  HDB 
C43f53C31.  These  programs  use  a  rather  forced  dialog  to  input 
the  appropriate  fields  which  are  to  be  searched  on  and  the  values 
to  be  searched  for.  The  interactive  response  is  a  set  of 
statements/-  along  with  the  appropriate  document  number  and  the 
number  of  this  statement  with  respect  to  the  source  occument. 
There  is  no  capability  tc  net  a  count  of  documents  which  meet  one 
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criteria  or  set  of  criteria  and  then  deter mininq  whether  that  set 
should  be  farther  limited  by  ANDinq  with  another  set.  The  entire 
request  is  made  at  the  outset/  and  the  entire  set  of  documents 
which  match  the  request  is  printed  as  output.  There  seems  to  be 
no  capability  of  savinq  the  numbers  of  these  documents  for  later 
refinement  by  further  search  requests. 

Two  other  programs  exist  in  the  Prototype  HDP  system  which 
reflect  both  conceptual  and  current  needs.  The  function  of  the 
programs  is  to  allow  the  user  to  see  the  bibliographic  citation 
of  a  document  if  he  knows  the  document  number  and  to  see  the  text 
of  the  document  if  he  knows  the  number.  Note  that  AN  DOR  returns 
the  statement  and  the  number  of  the  document.  (The  document  is 
not  really  there/  it  is  just  the  collection  of  all  statements 
which  came  from  that  document) 

This  technique  of  finding  statements  is  perhaps  approoriste 
to  some  potential  users  of  the  HDB.  In  particular/  the  person 
who  wishes  to  write  a  criteria  manual  for  design  of  a  certain 
training  facility  might  want  to  retrieve  what  is  available  and 
related  to  that  kind  of  facility.  How ever/  information 
specialists  responding  to  a  submitted  query/  and  to  some  extent 
the  end  customer  himself/  might  find  that  a  better  way  of 
expressing  the  inquiry  and  conducting  the  search  is  needed.  The 
HDB  does  not  contain  keywords  which  can  be  used  to  characterize 
content  of  statements.  (In  its  present  form/  content  is  only 
characterized  by  the   index   digit   string).    Thus   a   retrieval 
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system  based  on  full-text  search  of  the  statements*'  preferably 
with  natural  language  input/-  is  a  second  conceptual  need  of  the 
HDB. 

At  the  present  time  this  need  is  met  by  the  collection  of 
programs  known  as  the  SMART  system.  This  system  operates  in 
batch  mode  on  the  IBM  36  0  system.  It  has  been  implementeo  3 1  the 
Computer  Services  Office  (University  of  Illinois  at  Urban a)  as 
part  of  the  prototype  HDR  effort.  "The  system  takes  documents  and 
search  requests  in  English/  performs  a  fully  automatic  content 
analysis  of  the  texts/  matches  analyzed  documents  with  analyzed 
search  requests/  and  retrieves  those  stored  items  believed  to  be 
most  similar  to  the  queries.  Among  the  language  analysis 
procedures  incorporated  into  the  system  are  word  suffix  cutoff 
methods/  thesaurus  lookup  procedures/  phrase  generation  methods/ 
statistical  term  associations/  syntactic  analysis/  hierarchical 
term  expansion/  and  others. " C  6  3 

As  a  part  of  the  SMART  user  interface  for  the  prototype  HDB/ 
a  prooram  on  the  DfcC-10  computer  accepts  input  o'f  Query 
submittals  and  formulates  batch  jobs  for  the  360.  These  jobs  are 
submitted  across  a  link  to  the  batch  machine.  The  user  returns 
later  to  see  if  his  job  is  done  and  relieves  his  output 
(responses  to  his  query)  by  runninq  another  prooram  on  the 
DEC-10.  The  time  lao  between  request  and  response  has  not  been 
satisfactory  with  the  present  implementation. 
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What  is  needed  and  is  missing  in  the  current  implementation 
is  an  interactive  on-line  version  of  SMART.  Salton  recognized 
this  as  a  need  C63.  To  run  SMART  interactively  would  require  a 
different  operating  system  on  the  360*  namely  one  that  allows  for 
time-shared  user  interactive  terminals.  There  have  been  no  major 
updates  of  SMART  since  the  library  was  obtained  from  Cornell  for 
the  prototype  HDB.  At  latest  report*  no  interactive  version  of 
SMART  is  available  in  release  form*  although  some  effort  was 
expended  at  Cornell  in  implementing  an  interactive  version  under 
the  IBM  TSO  operating  system.  Even  if  that  were  successful*  it 
would  be  of  little  value  to  any  solution  which  proposes  using  the 
360  at  CSG*  since  that  system  will  stay  batch  until  its  eventual 
ret  i  r ement . 

If  the  conceptual  need  for  natural  lanquaae  processino  of  a 
query  is  artificial*  some  of  the  systems  to  oe  mentioned  in 
Section  5  would  probably  well  serve  the  needs  of  the  HDB. 


4.2  Current  Needs  of  the  Habitability  Data  Base 
The  current  need  of  the  HDP  is  a  system  which  provides  for 
the  conceptual  needs  outlined  above  as  well  as  the  more  immediate 
concerns  of  finding  an  appropriate  operating  system  and  computer 
to  run  it  on.  The  scope  of  work  for  this  contract  lists  five 
definitions  of  the  needs  of  the  HDB.  Two  of  these  fall  into  the 
class  of  con   rtual  needs: 
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1)    the  types  of  programs  currently   in   use   must   be 
ava i  table 

?)    the  system  has  the  capability  of  handling   summary 
data  as  well  as  bibliographic  and  textual  data 

These  have  been  examined  in  the  section  on  conceptual  needs. 
The  other  needs  are  current  needs  discussed  in  this  section. 

The  text -editing  capability  is  desirable  so  that 
corrections  and  changes  can  be  made  to  the  HDB  statements/  and  so 
that  new  statements  can  be  added  as  the  collection  nrows.  Any 
system  which  will  be  capable  of  the  interactive  access  required 
for  the  AND/OR  programs  will/  without  exception/  have  text- 
e d i cinq  capability.  So  long  as  the  HDB  statements  are  part  of  a 
non-specific  text  file/  they  be  accessible  and  editaole  with  the 
editors  on  most  systems. 

However/  the  capability  to  edit  HDr-  statements  which  are 
already  included  in  a  data  base  which  has  under  a  one  sotiip  o'earee 
of  inversion  might  be  somewhat  of  a  problem.  The  typical 
retrieval  system  requires  that  the  data  and  the  fields  which  will 
be  searched  be  made  ready  for  a  larqe  inversion  process  which  is 
run  against  the  data  base  to  get  it  properly  organized  for  faster 
retrieval.  In  some  organizations  this  data  base  inversion 
process  is  very  time  consumino.  The  capability  to  access  the 
statements  independent  of  indexes  to  the  statements  is  thus  a 
requirement  for  on-line  correction  to  the  HDB  statements. 
Similarly/  in  order  to  keep  the  data  base  updated/  i*  should  L>e 
possible   to   input   new   statements  in  text  form.   This  is  not  a 
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problem.  However/  it  is  quite  likely  that  before  new  statements 
can  be  used  as  an  integral  part  of  the  HDB*  the  inversion  process 
must  be  run  aoain.  This  would  restrict  updating  to  periodic 
updates  of  perhaps  once  a  month.  This  is  the  norm  rather  than  the 
exception  in  data  base  systems  of  the  capability  described. 

Commercially  available  systems  will  automatically  be  aole  to 
take  care  of  control  and  billing  of  outside  users  (they  make  a 
living  doing  it).  University  computer  centers  sometimes  have 
more  difficulty  with  this  in  that  their  process  for  establishing 
user  accounts  is  sometimes  rather  cumbersome.  However/-  the 
systems  under  consideration  and  outlinea  in  the  accompanyina 
recommendations  all  meet  the  criterion  that  outside  users  can  be 
admitted  to  the  system  and  billed  directly.  Similarly*  the 
capability  for  remote  Low-speed  access  fro^  terminals  should  be 
taken  as  given  in  all  of  the  systems  under  discussion  here.  The 
only  systems  for  which  this  is  not  the  case  are  systems  for  which 
access  is  restricted  to  remote  batch/  and  such  a  system  cannot 
meet  the  editing  and  interactive  requirements.  Where  necessary* 
submission  to  batch  systems  should  be  accomplished  via  an 
interactive  system*  similar  to  the  technioue  used  between  the 
DEC-10  and  the  36(1  in  the  prototype  HDB  work.  This  should 
always*  however*  be  considered  as  clumsy  snd  not  conducive  to  the 
kind   of   immediate   feedback   to   be   obtained   with  interactive 
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systems  such  as  those  commercially  available. 


5.  NBS  Survey  of  Interactive  Information  Systems 

The  National  Bureau  of  Standards  has  already  anticipated  the 

need   for   government   agencies   to   consider   the   choice   of  an 

interactive  information   system.    A   report   published   in   1974 

constitutes  a  reference  to  the  technical  features  and  one  rational 

status  of  such  systems  available   at   the   time   [?3.    From   the 

introduction  to  that  report: 

"This  report  is  written  for  the  purpose  of 
providing  Federal  ADP  customers  with  information 
on  a  certain  class  of  computer  systems  which  are 
capable  of  handling  scientific  and  technical 
inform ation„  The  report  attempts  to  show  what  is 
available  and  to  characterize  these  systems  in 
such  a  way  as  to  answer  questions  which  naturally 
arise  prior  to  selecting  such  a  system  for  a 
particular  installation.  The  report  is  written  at 
a  level  of  technical  detail  which  is  aimed  at 
information  specialists  rather  than  programminq 
experts.  It  is  intended  to  be  informative  and 
instructive*  and  not  critical  or  evaluative." 

"We  have  reviewed  tor  inclusion  in  this  index 
over  200  systems  which  came  to  our  attention  from 
various  published  and  unpublished  sources  as  well 
as  from  word-o f-mou t h .  The  systems  which  were 
selected  conform  to  the  following  definition: 
"Information  Retrieval"  or  "Data  Management" 
packaqes  or  services  which  are  available  to  any 
Federal  ADP  installation*  and  which  offer  an 
interactive  query  and  search  capability  that  is 
geared  tor  use  by  non-programmers." 

They  eliminated  from  consideration  systems  which:  1)  are 
batch  systems*  I)  have  query  languages  not  for  use  by  non- 
pro  q rammers/  .">>  are  in  research  or  development/  4)  a.e  no  longer 
supported/   5)    are   no   lonqer  in  business  or  locatahle/  6 )  are 
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subject  to  legal  or  security  p r  o  b  I  e  m  s  in  the  way  of  releasing  the 
system/  or  7)   were  not  documented. 

It  seems  at  least  strongly  suggestive  that  these  systems 
meet  the  basic  needs  of  the  US  Army  CERL*  if  one  of  them  meets 
the  specific  conceptual  needs  of  the  HDB. 

The  intent  of  this  section  is  to  examine  the  organization  of 
that  report  and  to  frame  current  concerns  in  terms  of  the 
selection  criteria  outlined  therein.  Table  1  is  a  list  of  the 
systems  which  met  the  criteria  for  inclusion  in  this  survey. 
Table  2  is  the  questionnaire  which  was  usee?  to  characterize  the 
features  of  the  various  systems.  Included  in  the  reoort  is  a 
summary  of  the  features  of  each  of  the  examined  systems/  listed 
in  a  manner  similar  to  the  format  of  the  Questionnaire. 

However^  before  examining  each  of  the  systems  reported/  the 
suggestion  is  mede  that  the  needs  of  potential  users  of  the 
system  be  classified  in  order  to  make  a  first  cut  at  system 
selection,,  Their  recommendation  for  a  first  elimination  is  based 
upon  potential  usage  and  estimated  cost  first/  then  on  the 
availability  of  a  given  main-frame/  ar\cj  in  the  case  of  a 
requirement  for  a  specific  data  base/  on  the  availability  of  that 
data  base  as  a  service.  In  the  case  of  the  HDB  investigation/ 
several  choices  of  main-frame  are  available/  and  it  has  not  been 
determined  whether  a  package  should  be  put  up  on  one  of  these 
mainframes  or  a  service  bureau  should  be  used.  Sinc^  a  decision 
can   be   made   on   these   choices  at  a  later  time/  we  can  proceed 
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Name 

BASIS 

CDMS 

CIRCOL 

(Data/ Central) 

DIALOG 

DMARS 

DML 

DRS 

DS/3 

EMESARI 

ENFORM 

FLEXIMIS 

GIM 

GIPSY 

IMARS 

IMS(OEP) 

IMS/360 

IMS/8 

INQUIRE 

INSYTE 

LEADERMART 

MARK  IV 

MARS  III 


Name 
MARS  VI 

MASTER  CONTROL 
MICROTEXT 
MINIDATA 
MIRADS 
MUSE 
NASIS 
N.Y. TIMES 
OLIVER 
ORBIT  III 
PIRETS 

QUERY  UPDATE 
RAMTS 
RECON 
RFI 
RIQS 
SHOEBOX 
SOLAR 
SPIRES  II 
STAIRS 
SYSTEM  2000 
TICON 
UNIDATA 


TABLE  1.   SYSTEMS  INCLUDED  IN  THE  NBS  SURVEY 
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immediately  to  elimination  of  unsuitable   choices   based   on   the 
technical  features. 

The  NBS  report  suaoests  drawing  distinctions  in  three  broad 
classes  of  system  applications:  formatted  data  processing/ 
structured  text  searching/  and  personal  text  handling.  The  needs 
of  the  HDB  fall  into  the  class  of  structured  text  orocessinq.  in 
the  following  excerpt  from  the  NBS  report/  the  items  in 
parentheses  refer  to  the  characteristic  features  listed  in  the 
Quest  ionnai  re . 
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The  data  files  of  structured  text  searching 
systems  would  be  expected  to  be  unchanging  in 
content  and  very  large  in  volume.  It  would  be 
expensive  to  reorder  or  restructure  them  as  new 
data  is  received/  so  it  would  be  desirable  for  the 
system  to  accept  new  data  in  any  order  (D.3). 
Other  desirable  features  would  extend  content 
searching  capability*  for  example  by  giving  a 
synonym  facility  (  E  .  ?  .  d )  or  a  presentation  of 
other  terms  that  are  conceptually  related  (E.2.e). 
As  in  formatted  data  processing*  tutorial  aid  is 
desirable.  In  contrast  to  that  application 
however*  full  Boolean  capability*  optional  report 
formattinu*  and  optional  ordering  ar^  sugaested 
here  as  desirable  rather  than  essential.  Only  a 
Boolean  AND*  allowing  the  conjunction  of  distinct 
search  terms*  is  imperative  for  user  convenience* 
to  avoid  a  tedious  selection  from  record  subsets 
found  by  individual  terms.  Optional  formatting 
and  ordering  may  not  be  used  often  for  such  simple 
structured  output  records  as  bibliographic 
citations.  A  standard  output  presentation  then  is 
generally  sufficient*  unless  text  fields  become 
numerous  and  frequently  of  marginal  importance* 
requiring  more  selectivitty  to  be  niven  the  user. 

The  chart  from  the  hbS    report  for   cateqorizinq   systems   is 

reproduced   here   as   Table   3.    Figure  1  shows  just  the  entries 

which  have  an  x  in  the  feature  row   cor r espond i ng   to   structured 

text   processinq   systems.   This  figure  shows  in  a  compact  format 

the  choices  which  on  the  face  of  it  would  be  suitable  for  the  HDB 

application.     Those   systems   marked   with   a   "+M   are   listed 

specifically  as  allowing  customer  data  bases  to   be   added   to   a 

"service"   system.    Also*  two  systems  are  included  on  this  table 

which  are  not  mentioned  in  the  \RS  report.   These  are   the   CELDS 

system    and   the   EUREKA   system   currently   in   some   stage   of 

development  at  the  Urbana  campus  of  the  University   of   Illinois. 

The   SMART  system  is  also  indicated  on  this  chart*  though  it  does 
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not  meet  the  criteria  required  for  inclusion  in  the  NBS  survey. 
Tables  A  and  5  characterize  the  SMART  and  CELDS  system  features/ 
while  Table  6  characterizes  the  "ideal"  system  for  an  HDB. 

6.  The  Choice  of  an  Appropriate  HDb  System 

The  process  of  finding  an  appropriate  information  retrieval 
system  to  be  used  for  the  HDF  reduces  first  to  the  decision  of 
whether  or  not  natural  language  queries  are  necessary  and  then* 
if  they  are  necessary*  to  deciding  whether  SMART  is  satisfactory. 
If  it  comes  close  to  beina  satisfactory*  then  one  can  consider 
how  it  can  be  implemented  so  that  it  is  available  to  CFRL. 


6.1  The  Choice  of  Natural  Language  Query 

There  is  a  seemingly  unanswered  question  of  whether  or  not 
natural  language  inquiry  and  fully  automated  language  analysis 
procedures  are  effective  in  a  document  retrieval  environment  such 
as  the  HDB.  Note  here  that  we  are  assumina  that  the  HUB  task  is 
equivalent  to  document  retrieval  in  the  sense  that  the  statement 
content  is  similar  to  abstracts.  However/  in  the  techniques  now 
being  used  for  the  HDB  the  title*  author*  and  other  citation  type 
of  information  is  not  used.  The  only  thing  used  is  the  text  of 
the  statement  itself*  the  index  is  only  used  in  the  ANDOR 
approach . 
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This  is  an  important  question  laroely  because  of  the 
research  which  tends  to  cast  doubts  on  the  consistency  of 
manually  prepared  document  analysis.  Salton  reports  in  a  number 
of  research  studies  that  automated  lanquaqe  analysis  procedures 
can  provide  benefits.  One  of  the  major  results  of  a  recent  study 
was  that  although  simple  word  extraction  followed  by  boolean 
search  does  not  produce  retrieval  results  equivalent  in 
effectiveness  to  standard  manual  indexino  techniques/  a  variety 
of  techniques  can  be  added  to  obtain  retrieval  whose 
effectiveness  exceeds  conventional  manual  methodologies.  When 
these  factors  are  added  to  the  expense  of  preparing  the  index  and 
thesauri  by  hand/  the  argument  to  stay  with  automated  techniques 
becomes  stronger. 

One  wonders  openly  whether  the  choice  to  implement  the  HDB 
in  the  way  it  now  appears  was  made  with  full  understanding  of  the 
implications  of  this  ever-expanding  body  of  research  or  merely  as 
a  result  of  the  convenience/  or  even  the  personal  bias  of  one  of 
the  early  workers  on  the  project.  Certainly  the  CERL  HDB 
managers  have  a  choice  between  two  courses  of  action.  One  is  to 
use  a  manual  indexing  technique  coupled  with  a  manually  prepared 
thesaurus  or  set  of  key  words.  Given  this  choice  several  of  the 
commercially  available  and  tested  retrieval  systems  could  be 
adapted  and  the  HDB  would  be  searched  with  techniques  similar  to 
those  now  successfully  being  used  to  search  the  major  document 
data   bases   in  use  today  (NTIS/  ERIC/  Che-n  Abstracts/  etc).   The 


CERL  HDB  Recommendations 


31 


other  choice  is  to  continue  with  the  more  forward-looking  but 
less  proven  techniques  of  automatic  content  analysis  with  natural 
language  queries  as  represented  by  the  SMART  system.  The  system 
in  use  today  by  the  HDB  lies  somewhere  in  between  the  two 
extremes/  since  laborious  indexing  is  done  as  the  statements  are 
prepared/-  and  boolean  searches  of  a  sort  are  done  on  the  basis  of 
these  indexes.  But  this  is  complemented  by  running  SWART,  which 
does  not  ma"ke  use  of  the  indexes  at  all. 


6.2  Suitability  of  SMART 

There  are  some  questions  and  reservations  about  the  use  of 
SMART  as  a  major  tool  for  the  HDB. 

First  of  all,  the  SMART  implementation  requires  a  very  large 
core  region  on  a  360  system.  The  current  implementation  was 
intended  as  a  vechicle  for  experimental  work  in  information 
retrieval  techniques.  As  a  result/  much  of  the  size  of  the  code 
is  concerned  with  measuring  retrieval  performance.  Considerable 
portions  of  the  code  which  is  loaded  from  the  SMART  library  is 
never  actually  executed.  A  production  version  could  conceivably 
be  produced  which  would  not  include  as  many  measurinu  tools  and 
thus  could  be  somewhat  smaller.  One  current  goal  of  the  Cornell 
group  is  a  modular  implementation  so  that  one  could  load  only 
necessary  modules  for  a  production  environment  implementation.  An 
alternate  solution  would  be  to  implement  the  code  (which  is 
primarily  Fortran  with  some  assembly  language  subroutines)  on  a 
virtual  memory  operation  system. 
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Secondly*  the  current  implementation  is  strictly  a  batch 
system.  An  attempt  has  been  made  at  Cornell  to  implement  SMART 
under  TSO*  but  that  work  now  appears  to  have  fallen  by  the 
wayside.  It  seems  apparent  that  an  interactive  system  would  make 
it  easier  for  the  user  to  modify  his  searching  strategy  based 
upon  what  he  is  finding*  rather  than  submitting  a  number  of  batch 
jobs*  all  of  which  must  do  the  complete  search.  As  an 
information  system  to  be  used  by  information  specialists  this 
major  inconvenience  miaht  be  overcome*  but  in  our  view  it  is 
unlikely  to  ever  be  regularly  used  by  customers  directly  in  this 
mode . 

Some  thouaht  should  be  given  to  why  commercially  available 
systems  are  not  offering  automatic  content  analysis  and  natural 
language  queries  in  quite  the  same  way  that  SMART  attempts  to  do. 
The  systems  which  are  available  commercially  seem  to  be 
universally  built  on  some  variation  of  key  word  searching  and 
boolean  expressions  for  search  requests.  The  commercial  systems 
have  a  long  (up  to  10  years)  period  of  development  behind  them. 
When  these  efforts  started  natural  language  processing  was  not 
sufficiently  developed  to  make  it  worth  the  commercial  risk. 
Some  would  argue  that  it  is  still  not  worth  the  risk.  The  fact 
that  so  many  commercial  systems  use  key  words  tends  to  suggest 
that  the  technology  is  accepted  and  a  long  term  period  of  support 
can  be  envisioned.  The  implication  of  all  of  this  to  the  HD3  is 
that  if  what  is  needed  must  feature  natural  lanauage  queries  with 
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no  manual  preparation  of  key  words*  a  non-commercial*  semi- 
experimental  system  is  the  only  choice.  However*  if  the  current 
HDB  can  be  expanded  (either  by  hand  or  with  proorams)  to  include 
key  words*  one  of  the  commercially  available  systems  will  provide 
reliable  long  term  service  of  a  less  sophisticated  nature.  It 
may  even  be  that  simple  full  text  searchinq  of  the  statements 
themselves  using  a  controlled  thesaurus*  could  be  used  on  one  of 
the  commercial  systems. 

These  conclusions  should  not  preclude  pursuing  the  i  o  a  I  of 
interactive  languaqe  query  systems.  To  whatever  extent  this 
capability  is  crucial  to  the  lonq  term  goals  of  the  COF*  it 
should  be  pursued  as  an  adjunct  to  systems  like  the  HDB. 
However*  a  completely  adequate  job  of  data  storaae  and  retrieval 
in  support  of  an  Information  Access  Center  (1AC)  for  the  HDB  can 
be  done  with  commercially  available  systems.  Unfortunately*  some 
backtracking  will  be  necessary  to  associate  appropriate  keywords 
with  each  of  the  habitability  statements  if  that  course  is  taken. 


6.3  The  Choice  of  Continuing  *J  i  t  h  SMART 

One  possible  course  of  action  is  to  continue  usino  the  SMART 
system  in  its  present  form.  This  can  be  done  with  or  without  the 
concurrent  use  of  the  package  of  programs  loosely  associated  with 
AND/OR.  Options  that  are  directly  available  to  CERL  at  the 
present  tiree  include  continuing  with  the  CSO  installation  and 
running  the  software  that  is  now  available*  transferring  the 
SMART  system  to  the  Amdahl   installation   at   the   Univprsity   of 
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Michigan/  or  transferring  it  to  the  IBM  360/91  at  UCLA.  Of 
course/  it  is  always  possible  to  put  the  system  on  some 
nationally  available  time-sharing  service/  but  that  would  cause 
some  (solvable)  difficulty  with  the  local  availability  of 
printouts. 

The  DEC-10  and  360/75  installation  at  the  Computing  Services 
Office  (CSO)  of  the  University  of  Illinois  is  expected  to  stay 
available  in  its  present  form  only  through  the  middle  of  1977. 
At  that  time  the  present  indication  is  that  the  DEC-10  system 
will  be  taken  out  of  service.  The  general  expectation  is  that 
the  360  will  stay  in  service  through  the  middle  of  1978/  because 
of  the  demand  by  university  users  for  whom  conversion  will  be 
impossible  before  that  time.  Thus/  there  need  be  no  rush  to 
bring  up  a  different  system  if  one  is  willing  to  tolerate  the 
Ion a  turn-around  time  for  SMART  jobs.  Some  of  this  turn- a round 
time  is  a  result  of  having  to  ask  that  the  disk  with  the  HOB  be 
mounted  each  time  it  is  needed.  Requestinq  that  the  disk  be 
permanently  mounted  would  reduce  that  delay  but  produce  some 
small  increase  in  costs. 

Another  course  of  action/  if  the  choice  is  to  stay  with  the 
SMART  system/  is  to  move  the  library  to  the  installation  at  the 
University  of  Michigan  at  Ann  Arbor.  At  least  two  projects  at 
CfcRL  are  currently  usina  the  University  of  Michigan  system  with 
apparently  good  results.  If  the  choice  is  the  Michigan  system/ 
then  the  next  choice  is  what  to  do  about  the  AND  and  OR  proarams. 
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However/'  the  very  nature  of  the  Michigan  Time  Sharinq  (MTS) 
system  provides  a  useful  solution.  MTS  is  a  superior  time 
sharinq  system  which  allows  interactive  access  from  user  programs 
in  a  rather  general  way.  The  system  has  an  impressive  collection 
of  interactive  services/  including  a  oood  editor/  document 
preparation  systems/  and  convenient  handling  of  large  disk  files. 
It  would  be  possible  to  recode  the  AND  and  OR  proorams/  as  well 
as  the  programs  which  allow  one  to  see  the  documents  (statements) 
and  bibliographic  entries.  The  programming  could  all  be  done 
from  CERL  with  interactive  terminals.  Terminal  access  to  tt T S  can 
be  directly  by  FTS  line/  or  can  be  arranged  in  the  same  way  that 
some  other  projects  at  CERL  employ.  They  dial  to  a  phone  port  at 
CAC  which  is  attached  to  a  multiplexer/  the  other  end  of  which  is 
a  port  on  the  MTS  system. 

The  multiplexer  equipment  now  in  use  is  available  as  excess 
capacity  on  a  system  installed  by  CAC  for  another  project.  That 
project  is  currently  expected  to  continue  at  least  throuoh 
January  of  1977.  The  excess  capacity  of  this  line  is  expected  to 
be  available  so  long  as  that  project  continues  to  be  funded/ 
which  is  expected  to  be  for  more  than  another  year.  In  the  worst 
case/  that  in  which  the  CAC  project  no  longer  needs  the  access  to 
MTS/  it  would  only  require  four  regular  users  at  CfcRL  to  justify 
pooling  costs  to  put  in  this  equipment  themselves/  pay  the  same 
rate  each  that  is  currently  being  uaid  by  CERL  users  tor  this 
service/  and  have  the  multiplexer  strictly   for   CERL   use.    The 
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total  budget  for  the  multiplexer  connection  is  on  the  order  of 
$80 (J  per  month.  As  few  as  four  projects  could  use  such  a 
communications  system  to  keep  their  total  costs  below  long 
distance  access  costs. 

Remote  job  entry  from  the  Unix  system  at  CAC  is  now 
available.  Files  of  card  i^aoes  are  transmitted  to  ^TS  from  Unix 
disk  files  in  a  manner  similar  to  Hasp  work  stations.  Printed 
output  from  UM  is  available  on  the  equipment  at  CAC.  This 
service  is  expected  to  continue  so  long  as  the  line  to  Michigan 
is  needed  and  there  are  funds  to  support  it.  Charaes  for  use  of 
the  local  system  come  as  a  separate  bill  from  the  computer 
charges  assessed  at  Michigan.  The  communications  cost  is 
currently  billed  as  a  fixed  monthly  cost  for  the  use  of  the 
multiplexer  and  associated  phones. 

The  costs  at  UM  are  said  to  be  reasonable  accordinq  to  the 
CAC  users  of  MTS  .  The  only  noticeable  startup  costs  for  going  to 
this  solution  would  be  costs  associated  with  sendinr  the  HDB 
files  to  Michigan^  and  the  costs  of  reprog  r  ammi  rig  the  programs 
other  than  SMART  which  are  needed  to  continue  the  present  mode  of 
operation.  However/  our  experience  with  *TS  indicates  that  the 
level  and  reliability  of  the  service  at  Michigan  warrant  its 
serious  consideration. 
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Still  another  avenue  is  open  to  facilitate  stayinq  with  the 
SMART  system  without  being  concerned  with  the  continuing 
availability  of  the  CSO  360.  The  Campus  Computing  Network  of  UCLA 
is  available  on  the  ARPANET  and  can  be  accessed  as  easily  from 
any  arpanet  node  as  it  can  from  the  CAC.  The  time  sharina  system 
there  is  TSG/  which  certainly  does  not  comcare  to  f^TS  in  terms  of 
its  friendliness  to  the  user.  However/  larae  batch  jobs  can  be 
run  at  CCN/  and  printouts  can  be  returned  to  the  printer  at  CAC. 
The  charges  for  this  connection  could  probably  be  kept  to  on  the 
order  of  $!>  per  hour  connected  to  the  network/  plus  the  normal 
user  fees  at  CCfo.  The  360/91  installation  at  tCN  is  one  of  the 
more  reliable  places  we  have  come  in  contact  with  over  the  last 
two  years. 

The  interactive  portions  of  the  HDB  tasks  would  have  to  be 
receded  under  the  TSO  system  at  CCN.  However/  since  they  ere  now 
coded  in  Fortran/  a  mere  conversion  would  suffice  to  make  the 
system  as  useable  in  that  environment  oS  it  is  in  its  present 
environment.  Costs  there  are  comparable  to  costs  at  the 
University  of  Illinois/  except  that  in  our  experience  jobs  which 
require  a  large  region  size  (as  SMART  does)  generally  are  cheaper 
to  run  at  CCN.  Also/  because  of  more  core  on  the  CCN  system/ 
large  jobs  can  be  run  at  any  time  of  day  and  the  turn-around  time 
is  generally  better  than  for  a  comparable  large  job  on  the  360/75 
at  CSO.  Also/  the  processor  out  there  is  much  faster  and  as  a 
result/  the  wait  for  results  of  a  query  should  be  much  shorter. 
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In  either  the  University  of  Michiaan  or  the  UCLA  situations* 
the  disadvantage  of  SMART  being  a  strictly  batch  system  would 
still  apply.  However/'  with  the  appropriate  cooperation  of  the 
originators  of  SMART  at  Cornell/  either  of  these  systems  would  be 
suitable  for  converting  SMART  into  an  interactive  system.  This 
would  be  no  small  undertaking.  It  could  not  (or  should  not)  be 
done  without  the  active  cooperation  of  the  group  at  Cornell  who 
are  intimately  familiar  with  the  inner  workings  of  SMART.  The 
software  development  for  such  a  task  would  conservatively  take 
about  a  year  for  about  a  two  to  two  and  one-half  man-years  of 
prograreroi  nq . 

As  a  purely  batch  system*  SMART  could  be  installed  on  one  of 
the  nationally  available  time-sharing  systems  which  offers  IBM 
equipment.  If  the  remote  job  entry  equipment  at  CERL  is  sometime 
attached  to  such  a  service*  it  would  be  easy  to  move  a  copy  of 
the  SMART  library  to  such  a  service  and  run  just  the  SMART  system 
as  pure  batch  jobs.  If  the  system  also  supports  time-sharing 
service*  the  AND  and  OR  programs  could  be  recoded  just  as  they 
would  have  to  be  with  any  of  the  other  choices. 


7.  Developing  a  Replacement  for  SMART 

While  the  SMART  system  in  its  present  implementation  is  not 
quite  satisfactory  for  the  production  stages  of  the  hDB  effort* 
careful  consideration  must  he  given  to  any  proposals  to  change 
systems  at  this  staoe  of  development.  Certainly  a  charge  from 
the  implementation  on  the  IBM  360  and  DtC-10   system   at   CSO   is 
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going  to  be  necessary/  because  those  systems  are  scheduled  to  be 
phased  out  of  service  over  the  next  two  years.  Section  6 
discussed  some  of  the  issues  which  must  be  addressed  in  an 
information  retrieval  system  suitable  for  the  HDB*  but  outlined 
the  options  available  to  CERL  if  their  decision  were  to  stay  with 
the  SMART  program. 

The  assumption  in  this  section  is  that  a  decision  has  been 
made  to  abandon  the  SWART  proorams  and  develop  or  find  somethinn 
else.  6iven  that  assumption*  two  avenues  of  investigation  are 
open.  One  is  to  develop  the  CELDS  system  which  is  performing  a 
similar  function  for  environmental  data  bases  in  conjunction  with 
another  group  at  CERL.  The  other  is  to  make  the  necessary 
modifications  to  the  HDB  to  make  the  information  retrievable 
using  one  of  the  nationally  available  information  retrieval.  The 
DIALOG  system  at  Lockheed  is  given  as  an  example  because  it  comes 
the  closest  to  meeting  the  criteria  outlined  in  Section  5. 

7.1  Revision  of  CELDS  for  the  HDB 

Any  initial  implementation  of  HDB  on  a  CELDS-copy  retriever 
would  have  to  include  at  least  the  capabilities  that  ANDOR*  BIBAX 
and  DOCAX  already  provide  to  HDB  users.  CELDS  provides  these 
options  now*  and  in  addition  provides: 


1)  all  functions  are  combined  into  one  retrieval  language. 

2)  SAVE  interesting  and  often  used  output  sets 
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3)  HELP 

4)  partial  search  tells  user   how   many   statements   satisfy 

sub-expressi  ons 

5)  parentheses  and  full  expression  nest  inn 

6)  OOPS  to  return  to  previous  statement-set 

7)  allows  multiple  values  per  field 

8)  off-line  printing 

9)  simple  logon-louoff 

10)retains  fast  response  time  even  for  very  larae  databases 

To  convert  to  a  field-oriented  system  (like  CELDS)   the   HDB 

could  be  broken  into  the  following  fields: 

ACC  -  accession  number 

DOC  -  document  number 

STMT-  statement  number 

DATE-  date  published/  researched/  input 

[unknown  for  current  database! 
BIB  -  bibliographic  data 

AUTH-  name  of  author(s)  [unknown  for  current  DB3 
FUNC-  functional  area  code 
TRFC-  training  facility  code 
PHYS-  physical  settings 

ENV  -  environmental  descriptors  (however  many  apply) 
0CCU-  occupants 
PSTR-  posture 
INVM-  involvement 
0RGF-  organizational  functions 
SFCN-  function  of  statement 
TEXT-  the  text  of  the  statement 
KEY  -  keywords  Cunknown  for  the  current  DB3 

Several  new  values  would  have  to  be  added  to  the  SFCN   field 

including   "objectives"*   "data"*  and  "procedure".  Several  of  the 

fields  (such  as  PSTR  and  1NVM)  could  be  dropped  and  their   values 

used   as   KEYWORDS.    It  would  help  streamline  the  list  of  fields 

without  loss  of  generality.   The  DATE  field  is  a  useful  field   to 
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include*  but  not  strictly  necessary.  The  only  non-searchable 
fields  would  be  BIB*  TEXT/  and  DATE. 

CELDS-like  format  includes  one  line  per  field  and  each  line 
is  prefixed  by  accession  number  and  field  number.  The  current 
HOB  lines  are  suffixed  by  statement  number/  card  number/  and 
document  number*  and  separated  (unnecessarily)  by  'NEXT  TEXT1 
cards.  Converting  data  formats  would  be  fairly  simple/  except 
that  a  few  desirable  fields  would  be  missing  I  e.g.  keywords]  and 
the  current  HDB  uses  dioit  strings  for  the  indexes.  Names  would 
be  much  easier  for  novice  users  to  read.  These  could  be 
converted  automatically. 

Two  CELDS  input  proaratis  would  have  to  be  modified  sliohtly 
(made  more  general)  to  accomodate  the  different  field  names.  The 
CELDS  retriever  program  would  have  to  be  modified  to  use  the  new 
fields  also.  The  inversion  program  woulo  have  to  be  run  on  the 
newly  created  Habitability  Data  Base. 

The  next  obvious  improvements  would  include  adding  keywords 
to  the  database/  and  adding  an  on-line  thesaurus  to  the 
retriever.  Then  the  combined  retriever  could  be  modified  to  use 
the  thesaurus  to  recognize  concepts  in  a  very  SMART-like 
environment.  Concept  numbers  and  weighting  are  not  currently 
practical  for  interactive  searching/  but  this  could  make  a 
fascinating  research  project. 


7.2  Using  a  Commercial  Information  Retrieval  System 

One  of  the  more  popular  and  widely  used  of  the   commercially 
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available  information  retrieval  systems  is  the  DIALOG  system 
operated  by  Lockheed  in  Palo  Alto*  California.  This  system  was 
included  in  the  survey  discussed  in  Section  5.  If  a  commercially 
available  system  is  considered  as  a  home  for  the  HDR/  certainly 
DIALOG  should  be  considered  a  prime  candidate. 

The  decision  to  move  to  a  commercial  retrieval  system 
presents  questions  both  of  a  technical  nature  and  of  a  purely 
operational  nature.  We  address  both  kinds  of  questions/  but  from 
the  very  limited  basis  of  the  specific  information  which  is 
available  to  us  in  the  course  of  this  investigation.  We  consider 
first  the  technical  questions  of  what  would  be  reouired  to  put 
the  HD3  into  the  DIALOG  system. 

Putting  the  HDB  into  DIALOG  would  require  almost  exactly  the 
same  amount  of  effort  as  putting  it  into  CEL DS-f ormat .  DIALOG  is 
a  field -oriented  system  with  full  text  searchinq  ability*  but  not 
natural  language  query.  The  HDB  would  almost  certainly  have  to 
be  converted  to  a  DIALOG  format*  and  keywords  should  be  added. 
DIALOG  would  require  a  very  complete  thesaurus*  which  would  then 
be  available  on-line.  Full  text  searching  in  DIALOG  requires  an 
exact  match  to  the  words  in  the  statement. 

The  DIALOG  system  works  primarily  with  searches  on 
predefined  fields.  Although  the  system  is  designed  for 
b  ibl i oqraphi c  retrievals  the  similarity  to  the  information  in  the 
HDB  suggests  that  only  a  small  perturbation  of  the  H  D  ^  would  be 
required  for  conversion  to  DIALOB.   The  fields  in  the  index   used 
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with  the  HDB  statements  could  be  made  into  fields  in  the  DIALOG 
sense.  The  statements  in  the  HDD  are  similar  to  abstracts  and 
thus  could  be  treated  by  DIALOG  in  the  same  way  abstracts  are 
treated.  The  task  of  convertino  what  now  exists  in  HDB  to  a  form 
suitable  for  DIALOG  could  be  assisted  by  some  of  the  text 
processing  capability  in  the  UNIX  system  at  CAC. 

One  conceptual  dissimilarity  between  the  two  systems  is  that 
in  DIALOG  all  information  of  one  record  (or  set  of  records) 
concerns  a  sinqle  document/  and  there  is  no  field  to  refer  to  a 
parent  document.  In  HDB/  on  the  other  hand/  the  basic 
information  is  a  statement/  several  of  which  come  from  the  same 
parent  document.  It  would  be  possible  to  think  of  each  HDB 
statement  as  a  document  in  the  DIALOG  sense/  providing  that  an 
extra  field  is  added  to  yi  v  e  reference  to  the  parent  document. 
Also/  in  this  context/  it  would  probably  be  advisable  to  encode 
keywords  for  each  of  the  Hu3  statements.  other  information  would 
be  based  on  the  parent  document.  This  would  probably  need  to 
include  the  author  or  some  other  reference  to  the  source/  the 
date  if  that  applies/  the  corporate  author  if  one  exists/  and 
inevitably/  the  key  words  for  the  parent  document. 

Another  pressing  need/  in  the  event  of  this  choice  as  well 
as  several  others/  is  for  a  completed  thesaurus  for  the  HDB.  The 
approach  taken  in  the  thesaurus  for  the  early  rortions  of  the  HDB 
is  a  step  in  the  right  direction/  but  it  needs  to  be  expanded  to 
include   terms   peculiar   to   the   whole   ranqe   of   habitability 
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statements/  not  just  the  limited  subset  available  to  CERL  now. 
By  standardizing  the  HDR  vocabulary*  and  by  carefully  keywording/ 
DIALOG  could  be  a  fast/  easy  to  use  system  for  retrieving  from 
the  HOB. 

It  is  impossible  to  determine  the  cost  of  puttina  the  HDB  on 
DIALOG  nor  estimating  what  it  would  cost  to  run/  except  by 
comparing  the  complexity  of  HDB  to  some  of  the  other  available 
databases  for  which  at  least  representative  user  costs  are 
available.  The  cost  for  accessing  the  NTIS  data  base/  for 
example/  is  SZb  per  connect  hour.  (The  system  is  purely 
interactive.)  In  addition  to  this  a  communication  cost  is  added 
dependent  upon  the  mode  of  access.  For  access  via  Telenet  this 
charge  is  $8  per  hour.  Since  the  HDB  is  considerably  smaller 
than  NTIS/  one  would  expect  the  charge  to  be  less/  except  for  the 
fact  that  fewer  customers  might  mean  hi q her  prices. 

The  documentation  of  DIALOG  makes  it  \/ery  clear  that  they 
will  not  be  able  to  predict  the  cost  for  a  new  database  snd  the 
accompanying  service  to  access  it.  Such  an  estimate  could  be 
nothing  but  a  raw  guess  without  an  extremely  detailed  proposal 
from  Lockheed.  One  immediate  suggestion  is  that  Lockheed  should 
be  contacted/  given  as  much  information  as  possible  about  the 
HDB/  includinq  this  report/  and  then  asked  to  submit  a  cost 
proposal.  Sales  brochures  for  DIALOG  indicate  the  price  for  out 
of  the  ordinary  services  as  "negotiable." 
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Although  we  have  no  idea  what  it  costs  to  put  up  the  NTIS 
information  on  DIALOG*  it  seems  clear  that  one  of  the 
justifications  is  the  wide  interest  in  accessing  the  NTIS 
database*  and  thus  the  customer  base  with  which  to  recover  the 
installation  costs.  For  a  special  purpose  client  like  CERL  the 
cost  cannot  be  spread  over  so  many  customers  and  thus  the 
apparent  cost  will  seem  higher.  In  order  to  operate  an  IAC  which 
includes  the  capability  to  search  the  HDB  for  clients*  CERL  would 
thus  have  to  pass  on  fairly  high  operational  costs  to  the  client 
or  operate  the  service  at  a  loss  until  the  number  of  clients 
spreads  the  cost  out  over  a  wider  base  of  users. 

There  is  another  whole  question  which  is  still  unanswered  as 
to  whether  Lockheed  would  even  be  interested  in  putting  HDB  on 
their  system.  Certainly  the  customer  base  at  the  present  time 
would  not  warrant  their  covering  the  cost  of  transformino  the  HDB 
into  a  form  suitable  for  DIALOG.  CERL  would  either  have  to  do 
that  themselves  or  pay  Lockheed  to  do  it.  Now  it  is  certainty 
true  that  Lockheed  is  intended  to  be  a  profit  making  venture*  and 
thus  they  may  be  willing  to  put  whatever  someone  wants  onto  their 
system  for  an  appropriately  larqe  sum  of  money.  However/  it  may 
be  that  their  growth  plans  do  not  allow  for  yet  another 
potentially  large  data  base  to  come  on  the  scene  in  the  near 
future.  If  this  is  true  they  will  not  be  able  to  put  the  HDB 
database  on  DIALOG*  regardless  of  whether  or  not  they  could 
recover  their  costs  for  doing  so.   We  were  able  to  contact  DIALOG 
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users,  and  use  DIALOG  on-line.   The  DIALOG  users  we  sailed   were 
largely  pleased  with  Lockheed  service. 
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